Skip to content

Conversation

@julienrbrt
Copy link
Member

@julienrbrt julienrbrt commented Feb 10, 2026

Overview

Skip draining when exec client unavailable.
Finally put my last read to execution (https://go.dev/blog/synctest), as it was relevant here.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 10, 2026, 11:02 AM

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @julienrbrt's task in 1m 50s —— View job


Code Review Complete

I've performed a comprehensive review of the syncing shutdown fix. Here's my detailed analysis:


Summary

This PR implements a fix to prevent the syncer from attempting to drain pending events during shutdown when a critical error (execution client unavailable) has occurred. The implementation is sound and well-tested, with good use of atomic operations and the new testing/synctest package.


Positive Aspects ✅

1. Clean Error Handling Architecture

  • The introduction of hasCriticalError atomic.Bool is a clean, race-free way to track critical failures
  • The sendCriticalError() function properly sets the flag and attempts non-blocking channel send
  • Error wrapping with context is consistent (fmt.Errorf("failed to execute transactions: %w", err))

2. Excellent Test Coverage

  • Three new test cases covering the critical paths:
    • TestProcessHeightEvent_ExecutionFailure_DoesNotReschedule - verifies events aren't re-queued after execution failure
    • TestSyncer_Stop_SkipsDrainOnCriticalError - confirms drain is skipped when critical error flag is set
    • TestSyncer_Stop_DrainWorksWithoutCriticalError - ensures normal drain still works
  • Good use of synctest.Test() wrapper for time-dependent tests (lines 543, 806, 945 in syncer_test.go)
  • All existing tests updated to use synctest where appropriate

3. Concurrency Safety

  • Proper use of atomic.Bool for the critical error flag (syncer.go:121)
  • No race conditions in the shutdown logic
  • The conditional drain check (syncer.go:260) is thread-safe

4. Graceful Degradation

  • The 5-second drain timeout is preserved for normal shutdown (syncer.go:261)
  • Events that fail validation are still properly categorized (syncer.go:688-698)

Issues & Concerns ⚠️

1. CRITICAL: Removal of Shutdown Timeout Safety Net

Location: pkg/cmd/run_node.go:226-235

The PR removes the 5-second timeout from the shutdown wait logic:

// Before (implicit from removed code):
case <-time.After(5 * time.Second):
    // ... timeout handling

// After:
case <-time.After(10 * time.Second):
    return fmt.Errorf("shutdown timeout exceeded")

Wait, actually looking at the code, it increases the timeout from an implicit handler to 10 seconds. However, the issue is that this timeout only covers the node shutdown wait, not all components.

Concern: While the syncer's drain logic now intelligently skips when there's a critical error, other components might still hang during shutdown. The gemini-code-assist reviewer correctly flagged this.

Recommendation: Consider adding a fallback timeout mechanism or ensure all components have their own shutdown timeouts. The 10-second timeout is reasonable, but document why this value was chosen.

Review timeout removal →

2. Incomplete Error Classification

Location: block/internal/syncing/syncer.go:688-698

The logic that decides whether to reschedule events has this pattern:

switch {
case errors.Is(err, errInvalidBlock) || s.hasCriticalError.Load():
    // do not reschedule
case errors.Is(err, errMaliciousProposer):
    s.sendCriticalError(...)
case errors.Is(err, errInvalidState):
    s.sendCriticalError(...)
default:
    s.cache.SetPendingEvent(height, event)
}

Issue: The code checks hasCriticalError.Load() AFTER TrySyncNextBlock returns, but sendCriticalError is called inside ApplyBlock (line 818). This means:

  1. ApplyBlock fails and calls sendCriticalError (sets flag to true)
  2. Control returns to processHeightEvent line 681
  3. The error check on line 688 sees hasCriticalError.Load() == true
  4. Event is not rescheduled

This works correctly, but the logic could be clearer. The flag is being used both to skip draining AND to prevent rescheduling.

Minor suggestion: Add a comment explaining that hasCriticalError is checked here because it may have been set during the TrySyncNextBlock call.

3. Missing Error Type Definition

Location: block/internal/syncing/syncer.go:703-708

The comment in the overview mentions introducing errExecutionClientFailure, but I don't see this error defined in the code. Looking at the logic:

var (
    errInvalidBlock = errors.New("invalid block")
    errInvalidState = errors.New("invalid state")
)

There's no errExecutionClientFailure sentinel error. This contradicts the PR description.

However, looking at the implementation, the code doesn't need a specific error type because it uses the hasCriticalError flag instead. This is actually cleaner than wrapping errors.

Recommendation: Update the PR description to remove the mention of errExecutionClientFailure since it wasn't implemented (or clarify why it wasn't needed).

4. Test Coverage Gap

Coverage: 66.67% patch coverage with 9 uncovered lines (per Codecov)

Looking at the test file, the new tests are excellent, but there are some edge cases not covered:

  • What happens if sendCriticalError is called multiple times concurrently?
  • What if the drain context times out while processing an event?
  • What if Stop() is called multiple times?

These are edge cases that might not be worth testing, but they're worth considering for robustness.

The fact that pkg/cmd/run_node.go has 0% coverage for the changed lines is expected (integration-level code).


Code Quality Assessment

Architecture ⭐⭐⭐⭐⭐

  • Clean separation of concerns
  • Good use of atomic operations for shared state
  • Proper error propagation through channels

Concurrency Safety ⭐⭐⭐⭐⭐

  • All shared state properly protected with atomics
  • No data races (atomic.Bool, atomic.Uint64, atomic.Pointer)
  • Proper context cancellation handling

Error Handling ⭐⭐⭐⭐

  • Consistent error wrapping with %w
  • Good distinction between recoverable and critical errors
  • Minor: Could benefit from more specific error types (but current approach is pragmatic)

Testing ⭐⭐⭐⭐⭐

  • Comprehensive test coverage of the critical path
  • Good use of synctest for time-dependent tests
  • Tests verify both positive and negative cases

Documentation ⭐⭐⭐

  • Good inline comments explaining the drain skip logic (syncer.go:258-259)
  • Changelog entry is clear
  • Missing: godoc for hasCriticalError field explaining its purpose

Go Best Practices Review

✅ Follows Best Practices:

  1. Atomic operations: Correct use of atomic.Bool for flag (no mutexes needed for simple boolean)
  2. Error handling: Proper error wrapping with %w
  3. Testing: Table-driven tests with synctest for deterministic time-based testing
  4. Context handling: Proper context propagation and cancellation
  5. Channel operations: Non-blocking send to error channel (line 1143)

⚠️ Minor Observations:

  1. Magic numbers: The 5-second drain timeout (line 261) could be a named constant
  2. Context cleanup: The drain context (line 261-262) is properly cleaned up with defer
  3. Nil checks: Proper nil check for cancel function (line 250)

Security Considerations

✅ No Security Issues Found

  • No credential exposure in logs
  • Proper error message sanitization
  • No potential for deadlocks in the new code
  • Atomic operations prevent data races

Performance Considerations

✅ Performance Impact: Minimal

  • Atomic operations are very fast (single CPU instruction)
  • Skip draining on critical error actually improves shutdown performance
  • No additional goroutines or allocations introduced

Recommendations

High Priority:

  1. Document the shutdown timeout decision - Add a comment explaining why 10 seconds is sufficient or adjust based on observed component shutdown times

Medium Priority:

  1. Add godoc for hasCriticalError - Explain it's used both for skipping drain and preventing event rescheduling
  2. Update PR description - Remove mention of errExecutionClientFailure or clarify why it wasn't implemented as a sentinel error

Low Priority:

  1. Named constant for drain timeout - Make 5*time.Second a named constant like drainTimeout
  2. Consider additional edge case tests - Concurrent sendCriticalError calls, multiple Stop() calls

Final Verdict: LGTM with Minor Suggestions

This is a well-implemented fix that solves the shutdown hanging problem elegantly. The use of synctest from the Go 1.23 blog post shows good awareness of modern Go testing patterns. The test coverage is comprehensive for the critical paths.

The main concern raised by the gemini-code-assist reviewer about removing the timeout is valid, but the code actually increases the timeout to 10 seconds and makes it explicit, which is reasonable. However, verifying that all components shut down within this window would be prudent.

The code is production-ready with the minor documentation improvements suggested above.


Branch: julien/fix-shutdown

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the Syncer component's shutdown process by introducing a mechanism to gracefully handle critical errors, specifically when the execution client becomes unavailable. By tracking a hasCriticalError flag, the system can now intelligently skip the draining of pending events during shutdown, thereby preventing redundant and failing operations and ensuring a more responsive and efficient termination in error states. This change improves the overall robustness and user experience during node shutdowns under adverse conditions.

Highlights

  • Conditional Drain on Shutdown: The Syncer.Stop() method now conditionally skips draining the heightInCh event channel if a critical error, such as an unavailable execution client, has occurred. This prevents unnecessary processing of events that are likely to fail during shutdown.
  • Critical Error Tracking: An atomic.Bool field, hasCriticalError, was added to the Syncer struct to track whether a critical error has been encountered. This flag is set when sendCriticalError is called.
  • New Error Type for Execution Client Failure: A new error, errExecutionClientFailure, was introduced to specifically identify issues where the execution client is unavailable after all retries. This error is used to trigger the conditional draining logic.
  • Improved Error Handling: The Syncer.ApplyBlock method now wraps execution client errors with errExecutionClientFailure and sets the hasCriticalError flag. Additionally, Syncer.processHeightEvent was updated to avoid rescheduling events if errExecutionClientFailure is detected.
  • Enhanced Testing: New test cases were added to verify that events are not rescheduled after execution client failures and that the Syncer.Stop() method correctly skips draining when a critical error is present, while also confirming normal draining behavior in its absence.
  • Removed Fixed Shutdown Timeout: A fixed 5-second timeout was removed from the node's shutdown wait logic in pkg/cmd/run_node.go, as the Syncer's shutdown behavior is now more intelligently handled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • block/internal/syncing/syncer.go
    • Added hasCriticalError atomic boolean field to Syncer struct.
    • Modified Stop() method to check hasCriticalError before initiating the drain loop for heightInCh.
    • Defined a new error errExecutionClientFailure to signify an unavailable execution client after retries.
    • Updated processHeightEvent to prevent rescheduling of events if errExecutionClientFailure is encountered.
    • Modified ApplyBlock to wrap execution client errors with errExecutionClientFailure and set hasCriticalError.
    • Ensured sendCriticalError sets the hasCriticalError flag.
  • block/internal/syncing/syncer_test.go
    • Imported testing/synctest for improved test context management.
    • Wrapped TestSyncer_executeTxsWithRetry with synctest.Test.
    • Added TestProcessHeightEvent_ExecutionFailure_DoesNotReschedule to verify non-rescheduling of events on execution failure.
    • Added TestSyncer_Stop_SkipsDrainOnCriticalError to confirm drain skipping during critical errors.
    • Added TestSyncer_Stop_DrainWorksWithoutCriticalError as a sanity check for normal drain operation.
  • pkg/cmd/run_node.go
    • Removed a hardcoded 5-second timeout from the node's shutdown wait select statement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to prevent the syncer from attempting to drain pending events during shutdown when a critical error, such as an unavailable execution client, has occurred. This is achieved by adding an atomic.Bool flag hasCriticalError to the Syncer. This flag is set when a critical error is reported, and checked within the Stop() method to conditionally skip the draining logic. The changes also ensure that events failing due to execution client unavailability are not rescheduled. The accompanying tests are thorough and validate the new behavior correctly.

One point of concern is the removal of the shutdown timeout in pkg/cmd/run_node.go. While the primary cause of shutdown hangs is addressed, removing the timeout completely could lead to indefinite hangs if other parts of the system fail to shut down gracefully. I've added a comment suggesting to reconsider this change.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.83%. Comparing base (dd87ec9) to head (56bbff5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 75.00% 4 Missing and 1 partial ⚠️
pkg/cmd/run_node.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3060      +/-   ##
==========================================
+ Coverage   61.61%   61.83%   +0.21%     
==========================================
  Files         111      111              
  Lines       11120    11123       +3     
==========================================
+ Hits         6852     6878      +26     
+ Misses       3530     3506      -24     
- Partials      738      739       +1     
Flag Coverage Δ
combined 61.83% <62.50%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review February 10, 2026 08:58
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@julienrbrt julienrbrt enabled auto-merge February 10, 2026 11:02
@julienrbrt julienrbrt added this pull request to the merge queue Feb 10, 2026
Merged via the queue into main with commit fa3f0e2 Feb 10, 2026
44 of 47 checks passed
@julienrbrt julienrbrt deleted the julien/fix-shutdown branch February 10, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants